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ABSTRACT 



A key frame representative of a sequence of frames in a 
video file is selected by applying face detection to a video to 
select a key frame which may include people and has 
particular application to indexing video files located by a 
search engine web crawler. A key frame, one frame repre- 
sentative of a video file, is extracted from the sequence of 
frames. The sequence of frames may include multiple scenes 
or shots, for example, continuous motions relative to a 
camera separated by transitions, cuts, fades and dissolves. 
To extract a key frame face detection is performed in each 
frame and a key frame is selected from the sequence of 
frames based on a sum of detected faces in the frame. 

29 Claims, 15 Drawing Sheets 
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KEYFRAME SELECTION TO REPRESENT A video file, is extracted from the sequence of frames. The 

VIDEO sequence of frames may include multiple scenes or shots, for 

example, continuous motions relative to a camera separated 
by transitions, cuts, fades and dissolves. To extract a key 

BACKGROUND OF THE INVENTION j frame face detection is performed in each frame and a key 

TTie Worldwide Web ("WWW") is comprised of millions frame is selected from the sequence of frames based on a 

of documents (web pages) formatted in Hypertext Markup ''^ °^ ^^^^"^""^ ^^'^ ""^ 

Language ("HTML"), which can be accessed from thou- ^^'^ detection m a frame may be performed by creaUng 

sands of users through the Internet. To access a web page, its » of miages for the frame. Each image in the set of images 

Uniform Resource Locator ("URL") must be known. Search ^° ^^'^^^ previous image. Each image is smaller 

engines index web pages and make those URLs available to '^Z P^*^°f . ™»8* same scale factor. Selected 

users of the WWW. To generate an index, a search engine °^ f ™»e«s ^'"^''ed for faces. The selected 

may search the WWW for new web pages using a web dependent on the minimum size face to detect. The 

crawler. The search engine selects relevant information from ^""'^''y °f J '^"^^'^^"^ ''^'^ '^="='""8 "^"l^P °^ 

a web page after analyzing the content of the web page and a detected face m consecutive frames, 

saves the relevant information and the web page's URL in Shot boundaries may be detected in the sequence of 

the index. frames. A key shot is selected from shots within the detected 

Web pages also contain links to other documents on the boundaries based on the number of detected faces in the 

WWW, for example, text documents and image files. By !!?°'- ^ shot score may be provided for each detected shot, 

searching web pages for links to image files, a search engine ^ ^^ot score is b^d on a set of measures. The measures 

connected to the WWW piovidcs an index of image files "^^V ^ 8""? s^ng of motion 

located on the WWW. Hie index contains a URL and a *'«'^f " ^'^""^ spatial acUv.ty between frames, skin pixels, 

representative image from the image file. ^^Sf • ^ach measure includes a 

„, , 1 . ■ 1- 1 . 1.- J- £1 L respective weighting factor. The weightmg factor IS depen- 

Web pages also contam Imks to multimedia files, such as ^^^^ „^ j^^^j confidence of the measure. 

Video and audjo files. By searching web pages for links to j . j-^- • r i 

multimedia files, a multimedia se^ich engine connected to ^^.f . ^^^^ process d^erent size frames by 

the WWW, such as Scour Inc/s SCOUR.NET, provides an "modifying the size of the frame before performmg the face 

index of multimedia files located on the WWW. SCOUR- detection. 

.NET'S index for video files provides text describing the 3^ BRIEF DESCRIPTION OF THE DRAWINGS 
contents of the video file and the URL for the multimedia ^he foregoing and other objects, features and advantages 
file. Another multmiedia search engme. WebSEEK, summa- invention wiU be apparent from the following more 
rizes a video file by generating a highly compressed version particular description of preferred embodiments of the 
of the video file. The video file is summarized by selecting invention, as illustrated in the accompanying drawings in 
a senes of frames from shots or scenes, m the video file and 3^ ^^ich like reference characters refer to the same parts 
repackaging the frames as an animated GIF file. WebSEEK throughout the different views. The drawings are not nec- 
also generates a color histogram from each shot in the video essarily to scale, emphasis instead being placed upon illus- 
to automatically classify the video file and aUow content- trating the principles of the invention, 
based visual quen^. It ^ described m John R. Smith et al. ^ iUustrates components of a multimedia search 
^An mage and Video Search Engine for the World-Wide ^ connected to the World Wide Web for generating an 
Web , Symposium on Electromc Imaging: Science and -^^^^ of multimedia files including an extracted key frame 
Technology — Storage and Retrieval for Image and Video ^ video file- 
Databases V, San Jose, Calif., Febuary 1997, IS&T/SPIE. CT*- 1 • fl U . U ' . f 

^. . . r. . . . . FIG. 2 IS a flowchart showing the steps for creating an 

Findmg a representative unage of a video to display is index of multimedia files including the file's URL and a key 

very subjective. Also, analyzing the contents of digital video 45 fj-^me' 

files linked to web pages is difficult because of the low ^Tr! 1 - a u * u • *u * r *l ^ c 

^. , , / ,P t ,u u- u^ A A- ^ FIG. 3 IS a flowchart showing the steps for the step of 

quality and low resolution of the highly compressed digital do^„io,ding multimedia files shown in FIG. 2; 

Video tiles 

FIG. 4 is a flowchart showing the steps for the step of 

SUMMARY OF THE INVENTION extractmg a key frame shown in FIG. 2; 

One technique for finding a representative image of a ^/ showing the steps for the step of 

video to display is to find a frame which is likely to include "^^P^^^^ frame measurements shown m FIG 4; 

people. This technique is described in co-pending U.S. ^ ^ a flowchart illustrating a method for detecting 

patent application Ser. No. 09/248,545 entiUed "System for ov mox^ faces m a frame of a video according to the 

Selecting a Keyframe to Represent a Video" by Frederic 55 P^^^^^P^^^ P^^^nt invention; 

Defaux et al. The likelihood of people in a frame is deter- illustrates the pyramid or set of images created 

mined by measuring the percentage of skin-color in the ^^^^ ^V^^ image 700; 

frame. Skin-color detection is a learning-based system FIG. 8 illustrates the different face sizes detected in each 

trained on large amounts of labeled data sampled from the of images 700, 702fl-e, in the pyramid shown in FIG. 7; 

WWW. Skin color detection returns, for each frame in the 60 FIG. 9A iUustrates a method for reducing false positives 

shot, the percentage of pixels classified as skin. by tracking a detected face across several consecutive 

The present invention provides a mechanism for selecting frames; 

a representative image from a video file by providing a FIG. 9B is a flowchart illustrating the method for tracking 

technique for applying face detection to a video to select a a detected face; 

key frame which may include people and has particular 65 FIGS. lOA-lOE illustrate luminance histograms and P;;^ 

application to indexing video files located by a search engine measurements which are described in conjunction with FIG. 

web crawler. A key frame, one frame representative of a 5; 
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FIG. 11 is a graph of pixel-wise difference values for 
successive frames; 

FIG. 12 is a flowchart illustrating the steps for detecting 
shot boundaries; 

FIGS. 13A-C illustrate the type of shot boundaries 
detected; 

FIG. 14 is a flowchart illustrating the steps for selecting 
a key shot; 

FIG. 15 is a flowchart illustrating the steps for selecting 
a key frame in the key shot. 

DETAILED DESCRIFnON OF THE 
INVENTION 

FIG. 1 illustrates a WWW-connected search engine 
including a webcrawler 122, a web server 124 for allowing 
web users to access an index 118, and a multimedia index 
system 100 for creating the index 118 of multimedia files. 
The crawler system 122, separate from the multimedia index 
system 100, is connected to the WWW and crawls the 
WWW searching for web pages containing URLs to multi- 
media files. The crawler system extracts key text, deter- 
mined to be relevant, from the web page and stores the text, 
the web page's URL, and the URLs of any multimedia files 
found on the web page. The components of the multimedia 
index system 100 for extracting representations of the mul- 
timedia files and classifying files include a librarian 108 for 
keeping track of data and controlling workflow in the 
system, daemons 104, 106, 110, 112, 116, and 120 for 
performing work in the system and a media server 114. 

The librarian 108 is a relational database. The daemons 
query the librarian 108 for work to perform and add to the 
librarian 108 work for other daemons to perform. The 
system daemons include a starter daemon 104, a getter 
daemon 106, a keyframer daemon 110, an audio classifier 
daemon 112, a reaper daemon 120 and a mover daemon 116. 
There may be multiple copies of each type of daemon, 
allowing the system to scale to index a large number of 
multimedia files. The operation of the components of the 
multimedia index system 100 is described later in conjunc- 
tion with FIG. 2. 

At step 200 in FIG. 2, a starter daemon 104 in the 
multimedia index system 100 periodically checks to see if 
the crawler system has identified multimedia URLs to be 
downloaded. If there are multimedia URLs to be 
downloaded, the starter daemon 104 downloads the multi- 
media URLs and relevant text from the crawler system, and 
puts them into the librarian 108. The addition of multimedia 
URLs to the librarian 108 by the starter daemon 104 creates 
work for a getter daemon 106. 

At step 202, a getter daemon 106 periodicaUy checks with 
the librarian 108 to determine if there are multimedia URLS 
to be processed. The getter daemon 106, using the multi- 
media URLs downloaded by the starter daemon 104, down- 
loads the multimedia files. Step 202 is described in greater 
detail later in conjunction with FIG. 3. 

At step 204, if the multimedia file is a video file, the getter 
daemon 106 adds work to the librarian 108 for the keyframer 
daemon 110. If the muUimedia file is an audio file the getter 
daemon 106 adds work to the librarian 108 for the audio 
classification daemon. 

At step 208, the audio classification daemon periodically 
polls the librarian 108 to determine if there are requests for 
classification of an audio file. The audio classification dae- 
mon analyzes the audio file, and classifies the audio file as 
either music or speech, and stores the classification with the 
audio file and the audio file's URL in the media server 114. 
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At step 206, the keyframer daemon 110 periodically poUs 
the librarian 108 to determine if there are requests for 
generating a representation for a video file. The keyframer 
daemon analyzes the video file and extracts a representation 
from the video file. The representation extracted is a key 
frame. After the key frame is extracted, the keyframer 
daemon 110 adds work to the librarian 108 for the mover 
daemon 116 and the reaper daemon 120. 

At step 210, the mover daemon 116 periodically polls the 
librarian 108 for work. Finding work created by the audio 
classification daemon 112 or the keyframer daemon 110, the 
mover daemon 116 moves the audio classification produced 
by the audio classification daemon or the keyframe produced 
by the keyframer daemon 110 to the index of multimedia 
files 118 which is available to the web server 124. 

At step 212, the reaper daemon 120 periodically polls the 
librarian 108 for work. Finding work created by the key- 
framer daemon 110, the reaper daemon 120 deletes the video 
file representative text and URL downloaded by the starter 
daemon 104 and the video file downloaded by the getter 
daemon 106. These files and representations are no longer 
required by the multimedia system because all work depend- 
ing on them has been completed. 

At step 300, in FIG. 3 the getter daemon 106 downloads 
a multimedia file from the multimedia URL as discussed 
later in conjunction with FIG. 2. 

At step 302, after the multimedia file has been 
downloaded, the getter daemon 106 determines the format 
of the multimedia file. Digital video files Linked to web 
pages may be in many different formats, including Audio 
Video Interleave ("AVI"), Advanced Streaming Format 
("ASF"), RealAudio, MPEG and Quicktime. The getter 
daemon 106 transcodes the digital video files to a common 
digital video format, for example, AVI format. After the 
transcoding, the getter daemon 106 stores the common 
format digital video file and a raeta-data file for the digital 
video file. The meta-data file includes information on the 
digital video file, such as the title, author, copyright and 
video frame rate. 

At step 306 the meta-data file and the common video 
format file are stored on local storage. 

Step 206 in FIG. 2 is described in greater detail later in 
conjunction with FIG. 4. FIG. 4 illustrates a high level 
flowchart showing the steps the keyframer daemon 110 
performs to select a key frame for the video sequence in 
common video format. 

At step 400 a number of measures are computed and 
stored for every frame in the video sequence. The measures 
include motion, spatial activity(entropy), skin color and face 
detection. The sequence of frames may be grouped into a 
sequence of shots. A shot is a sequence of frames resulting 
from a continuous operation of the camera in which there is 
no significant change between pairs of successive frames. 

At step 402, shot boundaries are detected in the video 
sequence. A shot boundary is detected by detecting a sig- 
nificant change between successive frames. The shot bound- 
aries are detected dependent on the measures computed at 
step 400. After the shot boundaries have been detected, a 
most interesting shot is selected from the video sequence at 
step 404 dependent on measures including motion activity, 
entropy, face detection, skin color and length of the shot. 
After the shot has been selected, a key frame is selected from 
within the selected shot at step 406 dependent on measures 
including motion activity, skin pixels, face detection and 
entropy. 

Step 400 in FIG. 4 is described in greater detail later in 
conjunction with FIG. 5. FIG. 5 iUustrates the measures that 
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are computed for each frame in order to select a key frame At step 602, in order to detect different size faces in the 

from a digital video file. Successive frames in the same shot input image, a low-pass pyramid is built from the input 

in a digital video file have the same or continuously varying image. A low pass pyramid is a set of reduced size images 

camera viewpoint with the only difference between the created from the input image. The input image is at the top 
frames being due to object motion or camera action. An 5 of the pyramid. The set of images is created from the input 

object motion may, for example, be a person walking and a f^^g^ ^y decreasing the size of the input image by a scaling 

camera action may be a pan or a zoom results in changes in factor Each image in the set of images is created by 

successive frames. decreasing the size of the previous image by the same 

. J • ^ jf . r • scaling factor. For example, each image in the set of images 

At step 500, face detection is performed for each frame in uTanor „p .u* ^i-,^ tk^ «<„«J«;^ 

. _ J * 4- - J u J • in can be 90% of the size of the previous image. The pyramid 

the video sequence. Face detection is described in com unc- r r . • • j ■ * n • 

, . , _ c 1 • * i . scahng factor is determined using the following equation: 

tion with FIG. 6. The presence of people m static photo- ^ 

graphs has been detected through the use of face detection. ^ 

A method for performing neural network face detection in scale Factor = f — y^-^'k-**^) 

a photograph is described in "Neural Network-Based Face U^tz_mmH.W)} 
Detection", by H. A. Rowley et al. in IEEE Trans, on PAMI, 

20 (l):23-38, 1998 which is incorporated herein by refer- scal6_start Ls the level of the pyramid in which to 

ence in its entirety. The method described by H. A. Rowley 5(3^ searching. 

et al. is an upright face detection system. A retinally con- area_min is set to 0.12 in order to look for a face in an 
nectcd neural network examines small fixed size windows of ^^^^ ^^^^^ ^2% of the image. 

an image and determines whether each window contains a u u^i^u* «f ti,^ irr.^^^ 

^ , , , 1.- 1 ^ 1 . H is the height or the input image, 

face. The system arbitrates between multiple networks to „, • ^.i. ^ *u • ^ • 

improve performance over a single network. To detect faces ^ is the width of the mpul image^ 

larger than the window size in the static image, a set of % 1;'°', '^l^^ ''^'^f ° . . h 

? , . . . , * • ' \ ^^r^i^A PIG. 7 illustrates the pyramid or set of images created 

reduced size images based on the static unage is generated. . . . . -aa • * ■ Tnn • *u 
rru . c ^ A ' • ™ K,, ;™,i^^iw 25 from the input image 700. The input image 700 is the top of 

The set of reduced size images is created by repeatedly . -j i «m . j .u - * - 

reducing thesizeoftheprevioLimage in thesetof reduced ^ir""^' T % from the input image 

size images. An image is reduced in size by subsampUng the !<» by reducing the size of mput image 700 by a scahng 

. ^. T7 J , ' r.y*^ ir. factor. Image 702£> is created by reducing the size of image 

previous image. Face detection is appbed to each image in u *u i- e ^ - %m • * a u 

the set of images by applying a neural network-based ."y h« f^'^^S ^^^^•^Z^" '"^^ .s created by 

algorithm on a fixed size window which is moved across the '° '^f . °f ^'^'^ by the same scalmg factor, 

image one pixel at a time. Hie window must be a fixed size ;P«'8« ''^'^^''^^ '^^^^-g ''^f '."^S^ 

u *u 1 * * • ^ the same scaling factor and image 702e is created by 

because the algorithm is trained to recognize races located . . . - ^a-.ju *u i- ^ * 

^ ^ reducing the size of unage 702d by the same scaling factor, 

within the window. rr--r*i • - a a • r 

. The size of the previous maage is reduced using samphng 

The window of the image is pre-processed by equahzing 3^ techniques weU-known in the art. The Scale Factor is 

the intensity values across the window m order to compen- dependent on the size of the original image. The pyramid 

sate for lightmg conditions. TTien histogram equahzaUon is ^^^^^ ^^^^^ ^^^^^^^ ^^^^ ^^^^^ Returning to FIG. 

performed to compensate for differences m camera mput ^ processing continues with step 604. 
gains and to improve contrast. TTie pre-processed wmdow is -^^^^^ 7^,^, 102a-d in the pyramid 

passed through a neural network^e neural network has ^ ^^^^^^^ ^ ^^^^ • ^ ^^-^^ ^^^^^^ ^ ^^^^ 
multiple types of hidden units. The hidden unite mclude ^^^^^ determine if the frame includes people, it is not 

units which look at 10x10 pixel subre^ons, 5x5 pixel necessary to detect aU faces in the frame. Also, frames 

subregions and overlapping 20x5 pixel honzontal stripes of ^^^y^^^ 3^,^!! faces are not likely to be representative of 
pixels. Each hidden umt detects feaUires that may be impor- ^.^^^ ^^^^ ^^^^^^^^ ^ performed in a portion 

tant for face detection, for example, mouths pairs of eyes, ^^^^^ ^ ^^^^^^^ ^p^^ .^^^^ ^^^^^^ ^ 

individual eyes, the nose and comers of the mouth. The . p ^^.^ ^^.^^ ^^^^ ^^^^ ^^^^^^^^^ 

neural network has a smglc, real-valued output which mdi- dependent on the minimum size face to find. For example, 

cates whether or not the window contains a face. ^^^^^ ^^^^^^^ of the total original frame may only 

All images in the set of images are searched for frontal ^e interesting and thus the starting image is selected to find 
faces. This is a very time consuming process in which it can 50 faces which are 12% of the total input image. Thus, the 

take up to four minutes to process a 320x240 pixel image. ^^^^^^ ^an be limited to a number of levels in order to look 

A video includes a sequence of images to search for faces ^j. j^rger faces in only smaUer images. However, all levels 

which are hkely not to be frontal faces because people in a ^re computed even though they are not searched because it 

video do not tend to look directly at the camera. Also, ^ relatively inexpensive to compute each level. In an 
searching the contents of digital video files Unked to web 55 alternative embodiment, only the levels to be searched may 
pages for faces is difficult because of the low quality and low computed. The number of levels of the pyramid to search 

resolution of the highly compressed digital video files. ^re dependent on a scale_end parameter and a scale_ 

FIG. 6 is a flowchart illustrating a method for detecting interval parameter. Scale__end is the level of the pyramid in 

one or more faces in a frame of a video according to the which to end searching and scale interval is the number of 
principles of the present invention. 60 levels to go down after each search. Providing the ability to 

At step 600, the face detector computes a scale factor select a portion of the levels of the pyramid reduces the 

dependent on the size of the frame size of the input image. processing time because face detection is more time con- 

The frame size for videos stored on the Internet are not a suming at higher levels. In an embodiment for detecting 

fixed size thus, face detection can be applied to videos with faces greater than 12% of the image, if the scaling factor is 
any size frame by computing the scale factor dependent on 65 selected to be 90%, face detection is performed in levels 4-6 

the frame size for the original image. Processing continues of the pyramid by setting scale_start to level 4 and scale_ 

with step 602. end to level 6. Processing continues with step 606. 
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At Step 606, a neural network based algorithm to detect FIG. 9B is a flowchart illustrating the method for tracking 

faces is applied to the frame. The neural network based a detected face. After face detection has been performed on 

algorithm is applied on a block of 20x20 pixels; that is, a all the frames in the sequence of frames as has already been 

fixed size window in the selected image in the pyramid. A described in conjunction with FIG. 6, a number of detected 

method for performing neural network-based face detection $ faces and the location of each detected face is stored for each 

in a static photograph is described in "Neural Network- frame. 

Based Face Detection", by H. A. Rowley et al in IEEE At step 920, the number of faces to track is set equal to 
Trans, on PAMl, 20 (l):23-38, 1998 which is incorporated the number of detected faces for the current frame. Process- 
herein by reference in its entirety. The fixed size window is ing continues with step 922. 

moved across the entire image one pixel at a time in order At step 922, the location of the detected face is compared 

to search for a face in the image contained within the fixed with locations of detected faces in the previoxis sequential 

size window. frame in the sequence of frames. If the location of the face 

FIG. 8 illustrates the different face sizes detected in each in the current frame overlaps with the location of a detected 

of the images 700, 702fl-e, in the pyramid shown in FIG. 7. face in the previous frame, the face may be a valid face and 

The smallest face is detected by searching for a face in the processing continues with step 924. If not, the delected face 

fixed size window 810 in the input image 700 and the largest in the current frame is not a valid face and processing 

face is detected by searching for a face in the fixed size continues with step 926. 

window 810 in the smallest image 702e. Other size faces are At step 924, the location of the face is compared with 

detected by searching in the fixed size window in the other locations of detected faces in the next sequential fi:ame in the 

images 700a-c. The fixed size window 810 is passed over sequence of frames. If the location of the face in the next 

the image 700, 702fl-e one pixel at a time. Returning to FIG, 20 frame overlaps with the location of a detected face in the 

6, processing continues with step 608. next frame, the face is likely a valid face because it overlaps 

At step 608, if a face is detected within the fixed size with the location of a delected face in the previous sequential 
window 810 in an image 700, 702a-d in the pyramid, frame and the next sequential frame from the current frame, 
processing continues with step 610, If not, processing con- Processing continues with step 928. If not, processing con- 
tinues with step 614. 25 tinues with step 926. ..... 

At step 610, the locaUon of the detected face in the Al step 926, an mvalid face was detected; that is, the face 

selected image 7Q0a-^ with respect to the input image 700 considered to be a false positive. Thus, the number of 

is stored. Processing continues with step 612. detected frames for the current frame is decremented. Pro- 

At step 612, the number of detected faces in the input ^^^smg continues with step 928. 

image 700 is incremented. Processing continues with step 30 ^^ep 928, the number of faces to track is decremented. 

^-^^ Processing continues with step 930. 

At step 614, if the last set of pixels in the frame has not A* ^^cp 930, the number of faces to track is examined in 

been searched, processing continues with step 606 to check ^^der to determme if there are more detected faces to track 

the next set ofpixels in the frame. Ifso, processing continues i^Lt^^^ ^^^^^ ^' processing continues with step 

with step 616 to continue scanning the next level in the 35 922 to determme if the next detected face is vahd. If not, 

pyramid processing is complete. 

At step 616, if the last frame in the set of reduced scale Returning to FIG. 5, at step 502 a pixel-wise frame 

frames has been checked, processing is complete. If not, difference number is calculated for each frame. A measure of 

processing continues with step 604 to select the next reduced ^^6 amount of difference between pixels m successive 

scale frame 40 ^^na^s may be used to determine a shot boundary in the 

Tlie face' detector is prone to false negatives and false digital video file. The pixel-wise frame difference number is 

positives. False-negatives are mainly due to rotated, computed by calculatmg the difference m mtensity between 

occluded or small faces such frames are more likely not ^ Pi^^^ ^^rent frame and the mtensity of the same 

interesting and thus not likely to be a representative frame, P^^^^ previous frame and adding the absolute value of 

therefore, false -negatives are not detrimental to the key 45 the differences of all pixels. For successive frames m a shot, 

frame extraction process, unlike faLse-posilives. Thus, a pixel-wise frame difference is a low value because the 

tracking system is used to track faces in successive frames d™*'^^ P^^^ ^^^^^g^ ^^"^ ^^^^^ 1°^- ^ 

in order to reduce the number of false-posiUves. ^'^^ ^^^^^ ^f pixel-wise frame difference mdicates a pos- 

FIG. 9Aaiustrates a method for reducing false positives ^"^^^ boundary. The following equation is used to 

by tracking a detected face across several consecutive 50 compute the pixel-wise frame difference number, 
frames. Three consecutive frames 700fl-c are tracked. Face 

900 and face 902 were detected in frame 700fl, face 904 and sAD{k) = V |/(/, /, ft) - k~i)\ 

face 906 were detected in frame 7006, face 908 and face 910 u 

were detected in frame 700c. It is assumed that a true face 

will be detected in the same region of the image in succes- 55 

sive frames, so those which are not can be discarded as false where: 

positives. Detected faces 900, 904 and 910 overlap in the j. k) denotes the image intensity at pixel location (i. j) 

three consecutive frames 700fl, 7006, 700c. Therefore, 900, ^ frame k of the sequential frames. 

904, 910 is counted because it is assumed to be a true face. I(i, j, k-1) denotes the image intensity at pixel location (i. 

However, detected faces 902, 906 and 908 are not likely to 60 j) in frame k-I of the sequential frames. 

be a true face; that is, they are false positives because they SAD(k) denotes the Sum of Absolute Difference of the 

appear in different regions in each consecutive frame intensity of all pixels in frame k and frame k-1. 

700fl-c. Thus, the number of actual faces detected in the The pixel- wise frame difference value is susceptible to 

frame is one instead of two. Tracking detected faces through false detection of shot boundaries because it is sensitive to 

consecutive frames reduces the number of false positives 65 rapid changes in movement. 

and thus increases the likelihood of finding a keyframe with At step 506 another measure of motion activity is com- 

people. puted to reduce false detections of shot boundaries based on 
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pixel-wise intensity difference. This measure of activity is 
based on a luminance histogram for the frame, that is, a plot 
of the luminance distribution in the frame, in which each 
pixel has a luminance value between 0 and 255. The 
cumulative distribution of the luminance histogram for the 
current frame and the previous frame are compared. The 
Kolmogorov-Smimov statistical test, a well known test in 
statistics, is used to compute the probability that the distri- 
bution of luminance histograms of frame k and frame k-1 
are the same. 



10 



At steps 510 and 512, a measure of forward and backward 
discontinuity is computed based on the pixel-wise frame 
difference between successive frames. The forward discon- 
tinuity measure is the difference between the current frame's 
pixel-wise frame difference and the next frame's pixel- wise 
frame differences. The current frame's pixel-wise difference 
may also be compared with more than one next frame's pixel 
wise frame difference and the maximum difference selected 
as the forward discontinuity. The equation is shown below: 

max 



Dik)-. 



\(CD{x,k)'CDix,k-l))\ 
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where: 

k is a frame 

x is the gray level value (x e[0, 255]) 

CD(x4c) is the cumulative distribution of the luminance 
histogram for frame k 

Pjtj(k) is the probability that the distribution of luminance 
histograms of frame k and frame k-1 are the same. 

In FIG. lOA a luminance histogram is shown for frame k. 
The number of pixels is on the y-axis. The pixel luminance 
values are on the x-axis. In FIG. lOB a luminance histogram 
is shown for frame k-1. As can be seen, the histograms for 
frame k-1 and frame k differ slightly for a number of pixel 
intensities. In FIG. IOC the cumulative distribution of lumi- 
nance for frame k is shown. FIG. lOD shows the cumulative 
distribution of luminance for frame k-1. As can be seen in 
FIG. lOE the difference between the cumulative distribution 
of luminance for frame k and the cumulative distribution of 
luminance for frame K-1 is small. is a single number 
computed for the frame with a value between 1 and 0 
dependent on the Kolmogorov-Smirnov statistical test. 

At step 504 in FIG. 5 a measure of spatial activity is 
computed for the frame. The measure of spatial activity is 
measured by the entropy of a frame using the equation 
below: 

where: 

p(x, k) is the probability of the gray-level value x in the 
luminance histogram of frame k. 

A high value of entropy indicates a frame with a high 
spatial content. A frame with a high spatial content has a flat 
histogram because the pixel luminance is spread out 
amongst all the possible pixel luminance values. A frame 
with a low spatial content has a histogram in which the 
luminance of aU pixels centers around the same luminance 
creating a histogram with a peak. For example, a frame 
including a boat in a lake on a cloudless day would have a 
histogram with a large portion of pixels centering around the 
color blue. 

Returning to FIG. 5, at step 508, a measure of the 
percentage of skin pixels is computed from a color histo- 
gram of the frame pixels. The color of each pixel in the 
frame is compared to a known distribution of skin-like or 
human flesh color. This measure is useful to indicate a frame 
likely to include skin, for example, to select a frame in a 
digital video file showing humans. 
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where: 

k is the current frame 

D^^^ is the forward discontinuity typically m=l or 2. 

A measure of backward discontinuity is the difference 
between the current frame's pixel-wise frame difference and 
the previous frame's pixel-wise frame difference. The cur- 
rent frame's pixel -wise difference may also be compared 
with greater than one previous frame's pixel-wise frame 
difference and the maximum difference selected as the 
backward discontinuity. The equation is shown below: 
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Bt^k {fc) = ^ iSADik) - SADii)) i = k-m k-l 



where: 

Dfc^^^. is the backward discontinuity 
k=current frame, typicaUy m=l or 2. 
FIG. 11 iUustrates a graph of pixel- wise difference values 
for successive frames k. Returning to FIG. 5, at step 514, if 
there is another frame to be processed, processing continues 
with step 500. If not, having computed all necessary mea- 
sures for the individual frames required for detection of shot 
boundaries and key shots, the system proceeds as follows. 

Step 402 in FIG. 4 is described in greater detail later in 
conjunction with FIG. 12. FIG. 12 is a flowing illustrating 
the steps for detecting shot boundaries in the digital video 
file. Two tests are used to determine if the current frame is 
a shot boundary. 

At step 800 testl is applied using the frame measurements 
computed in the steps illustrated in FIG. 5. Testl performs 
the following test: 

max (D^^kl D^^(*))//>^(*)>2*o 

where: 

a is the standard deviation of the pixel-wise frame dif- 
ference. 

Although the test relies on a ratio of D^^^ D^^^ and Pj^^ 
the test may be performed on either one. 

FIGS. 13A-13C iUustrate the type of shot boundaries 
detected by the two tests. The frames k are on the x-axis. A 
value dependent on the pixel- wise frame difference on the 
Pjfc, is on the y-axis. As shown in FIG. 13 A, testl detects a 
shot boundary between frames with a small D^^^j^ followed 
60 by frames with a large Dy^,^. This type of shot boundary 
occurs when a shot with high motion activity is followed by 
a shot with low motion activity. As shown in FIG. 13B, testl 
also detects a shot boundary between frames with a large 
^back foUowed by frames with a small D^^^. This type of shot 
65 boundary occurs when a shot with a low motion activity is 
followed by a shot with high motion activity. If a shot 
boundary is not detected using testl a second test, test2 is 
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performed at step 602. Test2 is applied iising the frame 
measurements computed in the steps illustrated in FIG. 5. 
Test2 performs the following test: 

max (Di,,.t(k)J^/orik))/PKs(k)>o and min {Dt,,Jk)^^^{k))} 5 

where: 

a is the standard deviation of the pixel-wise frame dif- 
ference. 

Test2 detects a shot boundary looking at both the maxi- 
mum and the minimum thresholds for T^back ^for- 
max threshold is less than in test 1 because of a higher 
confidence in detecting a peak (minimum and maximum 
value) instead of a step (minimum or maximum value). RG. 
13C illustrates a low motion activity shot followed by 
another low motion activity shot. Tcst2 detects this shot 
boundary. 

If testl or tcst2 is true the frame is labeled as a shot 
boundary at step 804. Having reached the end of a shot the 
total measure of the shot is computed at step 806. The total 
measure of the shot preceding the shot boundary is com- 
puted to determine a measure of how interesting the shot is. 
Interesting shots may be determined by the amount of skin 
colored pixels, the entropy, the amount of motion activity, 
number of detected faces and the length of the shot. The 
amount of skin colored pixels is used to determine the most 
interesting shot because typically the most interesting shot in 
the digital video is the shot with humans in it. The entropy 
is used to determine the most interesting shot because a shot 
with a low distribution of pixel intensity typically does not 
have a lot of objects in it. The amount of motion activity is 
used to determine the most interesting shot because shots 
with a lot of motion activity indicate that they are important 
to the digital video. The length of the shot is used to 
determine the most interesting shot in a digital video 
because typically the camera will stop at a position longer at 
an interesting shot. 

The factors to compute an interesting shot may be given 
weights to reduce the emphasis on one or more of the 
measures dependent on the type of digital video file. For ^° 
example, a digital video with a lot of motion activity in all 
shots may reduce the emphasis on motion so as to select the 
most interesting shot from other parameters. The equation 
for computing the total shot measure is shown below: 

45 

Score[shot) = 

M EDS AD MEDH MEDS SUMF T 

<^SAD O-H O-S 0"T 

50 

where: 

MEDH is the median of entropy of all frames in the shot. 

MEDS is the median of skin pixels percentage of all 
frames in the shot. 

MEDSAD is the median of pixel-wise frame difference in 
all frames in the shot. 

SUMF is the sum of all faces detected in the shot. 

Score(shot) is the total measure of the shot 

T is the length of the shot expressed in seconds. Qj^, a^, 60 
^SA£>r standard deviations of med^, 

med^;tm» tned^s^, T and F respectively computed on a 
training set, W^, W^, W^^^, Wj. and W^: are weighting 
factors for H, S, SAD, T and R 

The weighting factors W^, W^, W^^, Wj- and are 65 
scaling factors for the respective measure and are selected 
dependent on the reliability of the measure. A measure with 
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a high degree of confidence has a higher weighting factor 
than a measure with a low degree of confidence. The default 
values for the weighting factors are as follows: W^2, 
Wj«0.5. W^£,=l, and W^l. The weighting factor 

for entropy is highest because entropy is a reliable measure. 
However, if MEDH falls below a threshold value, the total 
score for the shot is set to zero. The threshold value is 
typically 4. The weighting factor for percentage of skin color 
pixels is lowest because percentage of skin color pixels is 
not a reliable measure. The weighting factor for face detec- 
tion is higher than that for percentage of skin color pixels 
because face detection is a more reliable measure of people 
in a shot or firame than the percentage of skin color pixels. 

The weighting factor for length of shot is modified for 
beginning and ending shots. Beginning and ending shots 
tend to be long shots but they are not interesting shots 
because they typically include text, such as an FBI warning 
at the beginning of the video and the credits at the end of the 
video. Thus, for the beginning and ending shots the weigh- 
ing factor for length of shot is decreased to zero or 0.2. 

The weighting factor for length of shot and percentage of 
skin color pixels are reduced if MEDSAD is greater than a 
threshold. The weighting factor for length of shot is 
decreased to 0.5 and the weighing factor for percentage of 
skin pixels is decreased to 0.25 because it is not likely that 
a scene with a lot of motion will include people. Scenes 
including people usually have low motion because the 
camera moves slowly. 

Step 404 in FIG. 4 is described in greater detail later in 
conjunction with FIG. 14. FIG. 14 is a flowchart illustrating 
the steps for selecting a keyshot. Knowing the shot bound- 
aries and the total measure for each shot, the most interesting 
shot is selected as the shot having the largest total measure. 

At step 1000, the keyshot deteaor determines if the 
current frame in the video file is labeled a shot boundary. If 
so, processing continues wiih step 1102. If not, processing 
continues with step 1002. 

At step 1002, the keyshot detector compares the total 
measure stored for the current shot with the total measure 
stored for the key shot. If the total measure of the current 
shot is greater than the total measure of the key shot, 
processing continues with step 1004. If not, processing 
continues with step 1006. 

At step 1004, the current shot is selected as the key shot. 
Processing continues with step 1006. 

At step 1006, the keyshot detector determines if the 
current frame is the last frame in the video file. If so, 
processing of the shots in the video file is complete. If not, 
processing continues with step 1000. 

All frames in the video file are checked for shot bound- 
aries until the last frame is reached. The total measure of all 
shots in the video file are compared and the shot with the 
highest total measure is selected as the most interesting shot 
in the video file. 

Step 406 in FIG. 4 is described in greater detail later in 
conjunction with FIG. 15. FIG. 15 illustrates the steps for 
selecting the most representative frame from the most inter- 
esting shot. 

At step 1100 a total measure for the frame is computed 
from the entropy of the frame, the percentage of skin color 
pixels, the number of faces detected and the pixel-wise 
frame difference calculated for the frame. The total measure 
of the frame favors the frame in the shot with the least 
motion activity because selecting a frame v/ith the most 
motion may result in the display of a fuzzy frame due to the 
motion. Motion is not captured well by compression algo- 
rithms often used on the digital video file located on the 
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WWW. The equation for selecting the total frame measure 
is provided below; 



Hik) S{k) Fik) 
Scoreiframe) = wh + vvy + + wp 



SAD{k) 
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where; 

Score (frame) is the total frame measure. 

H(k) is the entropy of frame k. lo 

SAD(k) denotes the Sum of Absolute Difference of the 

intensity of all pixels in frame k and frame k-1 , 
F(k) is the sum of the number of faces detected. 
S(k) is the percentage of skin-color pixels. 

is the standard deviation of H computed on a training 

set. 

OsAjD is standard deviation of SAD computed on a 
training set. 

Os a is the standard deviation of S computed on a training 20 
set. 

Of is the standard deviation of F computed on a training 
set. 

Wy,, V^SADy ^re weighing factors for H, SAD, 

S and F- 

The weighing factors are selected as discussed in con- 
junction with ETG. 14. The most interesting frame 
within the most interesting shot is the frame with the 
greatest amount of entropy relative to the amount of 
motion, that is, the space having the greatest frame 
measure value Score(frame)computed above, ftocess- 
ing continues with step 1102. 

At step 1102, the total frame measure of the current frame 
in the most interesting shot is compared with the 
keyframe measure stored for a previous frame or zero 
if the frame is the first frame to be examined in the most 
interesting shot. If the total frame measure is greater 
than the stored keyframe measure, processing contin- 
ues with step 1104, If not, processing continues with 
step 1100. 

At step 1104, the current frame is selected as the key 

frame. Processing continues with step 1106. 
At step 1106, the keyframe detector determines if the 
current frame is a shot boundary. If so, processing continues 45 
with step 1108. If not, processing continues with step 1100. 

At step 1108, the key frame for the most interesting shot 
in the video file is selected for the video. The keyframe can 
be stored with the video. Processing is complete. 

Returning to FIG. 2 after the key frame has been output so 
from the keyframer daemon 110. At step 206 the mover 
daemon 116 moves the key frame to the index of 
multimedia files 118 at step 210. At step 212 the reaper 
daemon 120 deletes the common format video file. 
In another embodiment of the present invention more than 
one keyframe may be output by selecting a keyframe from 
each of a number of the most interesting shots. 

It will be apparent to those of ordinary skill in the art that 
methods involved in the present system may be embodied in 
a computer software program product that includes a com- 
puter usable medium. For example, such a computer usable 
medium can include a readable memory device, such as a 
solid state memory, hard drive device, a CD-ROM, a DVD- 
ROM or a computer diskette, having computer readable 
software program code segments stored thereon. The com- 
puter readable medium can also include a communications 
or transmission medium, such as a bus or communications 
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link, either wired, optical or wireless having software pro- 
gram code segments carried thereon as digital or analog data 
signals. 

While this invention has been particularly shown and 
described with references to preferred embodiments thereof, 
it will be understood by those skilled in the art that various 
changes in form and details may be made therein without 
departing from the spirit and scope of the invention as 
defined by the appended claims. 

What is claimed is; 

1. A method of extracting a single representative key 
frame from a sequence of frames, the sequence of frames 
including a plurality of shots, comprising the steps of: 

performing face detection in the sequence of frames 
comprising the steps of: 

creating a set of images for each frame in the sequence 
of frames with each image in the set of images 
smaller than the previous image; and 
searching for faces having at least a minimum size in a 
selected portion of the set of images; 
detecting shot boundaries in the sequence of frames to 

identify shots within the detected shot boimdaries; 
selecting a most interesting shot from the identified shots 
based on a number of detected faces in the shot; and 
selecting the single representative key frame representa- 
tive of the sequence of frames from the selected shot 
based on a number of detected faces in the frame. 

2. The method of claim 1 wherein the selected portion of 
the set of images is based on the minimum size face to be 
detected. 

3. The method as claimed in claim 1 wherein the images 
arc smaller by the same scale factor. 

4. The method as claimed in claim 3 further comprising 
the step of: 

selecting the scale factor dependent on the size of the 
frame. 

5. The method as claimed in claim 1 further comprising 
the step of: 

tracking overlap of a detected face in consecutive frames 
in order to filter detected faces which are not likely to 
be valid, 

6. The method as claimed in claim 1 wherein the step of 
selecting a most interesting shot includes providing a shot 
score based on a set of measures selected from the group 
consisting of motion between frames, amount of skin color 
pixels, shot length and detected faces. 

7. The method as claimed in claim 6 wherein each 
measure includes a respective weighting factor. 

8. The method as claimed in claim 7 wherein the weight- 
ing factor is dependent on the level of confidence of the 
measure. 

9. The method as claimed in claim 1 wherein the step of 
performing face detection uses a neural network-based algo- 
rithm. 

10. An apparatus for extracting a single representative key 
frame from a sequence of frames comprising: 

means for performing face detection in the sequence of 
frames, the means for performing comprising: 

means for creating a set of images for the frame with each 
image in the set of images smaller than the previous 
image; and 

means for searching for faces having at least a minimum 
size in a selected portion of the set of images; 

means for detecting shot boundaries in the sequence of 
frames to identify shots within shot boundaries; 
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means for selecting a most interesting shot from the 
identified shots based on a number of detected faces in 
the shot; and means for selecting the single represen- 
tative key frame representative of the sequence of 
frames from the selected shot based on a number of 5 
detected faces in the frame. 

11. The apparatus as claimed in claim 10 wherein the 
selected portion of the set of images is based on the 
minimum size face to be detected. 

12. Tht apparatus as claimed in claim 10 wherein the lo 
images are smaller by the same scale factor. 

13. The apparatus as claimed in claim 12 further com- 
prising: 

means for selecting the scale factor dependent on the size 
of the frame. 

14. The apparatus as claimed in claim 10 further com- 
prising: 

means for tracking overlap of a detected face in consecu- 
tive frames to filter detected faces which arc not likely 
to be valid. 

15. The apparatus as claimed in claim 10 wherein the 
means for selecting a most interesting shot comprises: 

means for providing a shot score based on a set of 
measures selected from the group consisting of motion 
between frames, amount of skin color pixels, shot 
length and detected faces. 

16. The apparatus as claimed in claim 15 wherein each 
measure includes a respective weighting factor. 

17. The apparatus as claimed in claim 16 wherein the 
weighting factor is dependent on the level of confidence of 
the measure. 

18. The apparatus as claimed in claim 10 wherein the 
means for performing face detection uses a neural network- 
based algorithm. 35 

19. An apparatus for extracting a single representative key 
frame from a sequence of frames comprising: 

a face detector which performs face detection in the 
sequence of frames the face detector including: 
an image creator which creates a set of images for the 40 

frame with each image in the set of images smaller 

than the previous image; and 
a face searcher which searches for faces having at least 

a minimum size in a selected portion of the set of 

images; and 45 
a key frame selector which selects a key frame represen- 
tative of the sequence of frames from the sequence of 
frames based on a number of detected faces in the 
frame, 

20. The apparatus as claimed in claim 19 wherein the 50 
selected portion of the set of images is based on the size of 
the face to be detected. 

21. The apparatus as claimed in claim 19 wherein the 
images are smaller by the same scale factor. 

22. The apparatus as claimed in claim 21 further com- 55 
prising: 

a frame sampler which selects the scale factor dependent 
on the size of the frame. 

23. The apparatus as claimed in claim 19 further com- 
prising: 

a face tracker which tracks a detected face through 
consecutive frames to filter detected faces which are 
not likely to be valid. 
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24. The apparatus as claimed in claim 19 wherein the key 
shot detector comprises: 

a shot score generator which generates a shot score for 
based on a set of measures selected from the group 
consisting of motion between frames, amount of skin 
color pixels, shot length and detected faces. 

25. The apparatus as claimed in claim 24 wherein each 
measure includes a respective weighting factor. 

26. The apparatus as claimed in claim 25 wherein the 
weighting factor is dependent on the level of confidence of 
the measure. 

27. The apparatus as claimed in claim 19 wherein the face 
detector uses a neural network-based algorithm. 

28. A computer system comprising: 

a memory system storing a sequence of frames; and 

a face detector which performs face detection in the 
sequence of frames, the face detector comprising: 
an image creator which creates a set of images for the 
frame with each image in the set of images smaller 
than the previous image; and 

a face searcher which searches for faces having at least a 
minimum size in a selected portion of the set of images; 

a shot boundary detector which detects shot boundaries to 
identify shots within the detected shot boundaries; and 

a key shot selector which selects a most interesting shot 
from the identified shots based on a number of detected 
faces in the shot; and 

a key frame selector which selects the single representa- 
tive key frame representative of the sequence of frames 
from the selected shot based on a number of detected 
faces in the frame. 

29. An article of manufacture comprising: 

a computer-readable medium for use in a computer hav- 
ing a memory; 

a computer-implementablc software program recorded on 
the medium for extracting a single representative key 
frame from a sequence of frames, the sequence of 
frames including a plurality of shots, the computer 
implemented software program comprising instruc- 
tions for: 

performing face detection in the sequence of frames 
comprising the steps of: 

creating a set of images for each frame in the sequence 
of frames with each image in the set of images 
smaller than the previous image; and 

searching for faces having at least a minimum size in a 
selected portion of the set of images; 

detecting shot boundaries in the sequence of frames to 
identify shots within the detected shot boundaries; 

selecting a most interesting shot from the identified 
shots based on a number of detected faces in the shot; 
and 

selecting the single representative key frame represen- 
tative of the sequence of firames from the selected 
shot based on a number of detected faces in the 
frame. 

* « * « 4 
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